Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules
نویسنده
چکیده
This paper presents SoftMealy, a novel Web wrapper representation formalism. This representation is based on a finite-state transducer (FST) and contextual rules, which allow a wrapper to wrap semistructured Web pages containing missing attributes, multiple attribute values, variant attribute permutations, exceptions and typos, the features that no previous work can handle. A SoftMealy wrapper can be learned from labeled example items using a simple induction algorithm. Learnability analysis shows that SoftMealy scales well with the number of attributes and the number of different attribute permutations. Experimental results show that the learning algorithm can learn correct wrappers for a wide range of Web pages with a handful of examples and generalize well over unseen pages and structural patterns.
منابع مشابه
FINITE - STATE TRANSDUCERS FOR SEMI - STRUCTUREDDATA EXTRACTION FROM THE WEByChun
| Integrating a large number of Web information sources may signiicantly increase the utility of the WorldWide Web. A promising solution to the integration is through the use of a Web Information mediator that provides seamless, transparent access for the clients. Information mediators need wrappers to access a Web source as a structured database, but building wrappers by hand is impractical. P...
متن کاملImplementing Voting Constraints With Finite State Transducers
We describe a constraint-based morphological disambiguation system in which individual constraint rules vote on matching morphological parses followed by its implementation using finite state transducers. Voting constraint rules have a number of desirable properties: The outcome of the disambiguation is independent of the order of application of the local contextual constraint rules. Thus the r...
متن کاملReport on the Eighth International Workshop on Knowledge Representation Meets Databases ( KRDB ) , September 15 , 2001
The Eighth International Workshop on Knowledge Representation Meets Databases (KRDB) was held at the Ponti cia Universit a Urbaniana, in Rome, right after VLDB 2001. KRDB was initiated in 1994 to provide an opportunity for researchers and practitioners from the two areas to exchange ideas and results. This year's focus was on Modeling, Querying andManaging Semistructured Data. The one day progr...
متن کاملA Fuzzy Approach for Pertinent Information Extraction from Web Resources
Recent work in machine learning for information extraction has focused on two distinct sub-problems: the conventional problem of filling template slots from natural language text, and the problem of wrapper induction, learning simple extraction procedures (“wrappers”) for highly structured text such as Web pages. For suitable regular domains, existing wrapper induction algorithms can efficientl...
متن کاملAnalyzing new features of infected web content in detection of malicious web pages
Recent improvements in web standards and technologies enable the attackers to hide and obfuscate infectious codes with new methods and thus escaping the security filters. In this paper, we study the application of machine learning techniques in detecting malicious web pages. In order to detect malicious web pages, we propose and analyze a novel set of features including HTML, JavaScript (jQuery...
متن کامل